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ABSTRACT 

Many enhancers regulate their target genes via long- 
distance interactions. High-throughput experiments 
like ChlA-PET have been developed to map such 
largely cell-type-specific interactions between c/s- 
regulatory elements genome-widely. In this study, we 
integrated multiple types of data in order to reveal the 
general hidden patterns embedded in the ChlA-PET 
data. We found characteristic distance features re- 
lated to promoter-promoter, enhancer-enhancer and 
insulator-insulator interactions. Although a protein 
may have many binding sites along the genome, 
our hypothesis is that those sites that share cer- 
tain open chromatin structure can accommodate rel- 
atively larger protein complex consisting of spe- 
cific regulatory and 'bridging' factors, and may be 
more likely to form robust long-range deoxyribonu- 
cleic acid (DNA) loops. This hypothesis was vali- 
dated in the estrogen receptor alpha (ERa) ChlA- 
PET data. An efficient classifier was built to pre- 
dict ERa -associated long-range interactions solely 
from the related ChlP-seq data, hence linking dis- 
tal ER<x -dependent enhancers to their target genes. 
We further applied the classifier to generate addi- 
tional novel interactions, which were undetected in 
the original ChlA-PET paper but were validated by 
other independent experiments. Our work provides 
a new insight into the long-range chromatin interac- 
tions through deeper and integrative ChlA-PET data 
analysis and demonstrates DNA looping predictabil- 
ity from ordinary ChlP-seq data. 



INTRODUCTION 

Many distant enhancer elements in the human genome reg- 
ulate their target genes through long-range deoxyribonu- 
cleic acid (DNA) looping interactions (1-4). Such long- 
range interactions are often related to 3D chromatin con- 
formations that are important for gene regulation in specific 
cell types (5-7). In general, these chromatin interactions can 
be roughly grouped into two different types: one is closely 
related to gene regulation and dynamically changes during 
development or in response to external stimuli (8,9) and the 
other plays more of a structural role, forming non-tissue- 
specific chromosome conformations (10). 

To biochemically detect how and where these long-range 
interactions occur, chromatin conformation capture (3C) 
(1 1) and related methods, such as 4C (12) and 5C (13), have 
been developed. These methods are well suited for studying 
targeted local chromatin regions. Recently, genome-wide 
high-throughput techniques have been applied to detect 
large numbers of multiple interacting regions at the same 
time, which can delineate a global landscape of long-range 
chromatin interactions. Two of the best known methods 
are Hi-C (14) and ChlA-PET (15). Hi-C is directly derived 
from 3C, which sequences all the interacting DNA frag- 
ments with biotin-marked ligation junctions. It can detect 
substantial interactions simultaneously and is not restricted 
to the type of protein that 'bridges' these interactions. How- 
ever, Hi-C can only provide a relatively low-resolution in- 
teraction map (~1 Mbp), which is unsuitable for studying 
interactions between specific ds-regulatory elements, such 
as enhancer-promoter loops. ChlA-PET combines ChIP 
and 3C together, which enriches crosslinked DNA-protein 
complexes using an antibody against the protein of inter- 
est and uses proximity ligation to collect interacting DNA 
fragments tethered by the ChlP-ed protein. It has a higher 
resolution (similar to ChlP-seq) than Hi-C, but is limited 
to detecting interactions mediated by only one type of pro- 
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tein per experiment. Thus, Hi-C and ChlA-PET provide us 
with global view of different aspects of the chromosomal 
contact structure. However, they both require very exten- 
sive sequencing depth and are often affected by noise from 
random contact of DNA fragments in solution. 

We decided to integrate genome- wide transcription fac- 
tor (TF) binding and histone modification profiles in order 
to better understand ChlA-PET data for two reasons. First, 
it has been reported that active enhancers are most often 
bound by multiple TFs, which form large protein complexes 
to link enhancers to promoters (16,17). This may generate 
a different epigenetic pattern compared with simple pro- 
tein binding sites. Second, if two distant TF binding sites 
are interacting with each other, ChlP-seq experiments will 
likely detect both peaks in these two regions (18), and this 
phenomenon can partially explain the fact that some TF 
binding sites do not contain the canonical motif. We can 
infer that those regions pulled down by the same antibody 
are more likely to form interactions. These facts imply that 
aligning multiple TF binding sites (e.g. according to ChlP- 
seq and DNase-seq data) will be informative for predicting 
physical interactions between functional ds-regulatory ele- 
ments. 

We started from MCF7 estrogen receptor alpha (ERa) 
ChlA-PET data and combined gene expression, TF bind- 
ing, histone modification profiles and open chromatin con- 
formation data to uncover hidden features buried beneath 
the long-range interactions. Firstly, invariant distance fea- 
tures from different ChlA-PET data sets were extracted to 
characterize promoter-promoter (PP), enhancer-enhancer 
(EE) and insulator-insulator (II) interactions. Secondly, we 
determined the genetic and epigenetic features that could 
discriminate loop-associated and non-loop-associated ERa 
binding sites (ERBSs). Our analysis demonstrates that 
ERBSs associated with loop formation, especially those 
that contain an estrogen response element (ERE), are more 
likely to be nucleosome depleted, which is a surprise as some 
previous study (19) reported that ERa could bind to mi- 
crosomes directly. Assembly of co-factors, such as FoxAl, 
GATA3 and AP27 , and general co-activator p300 and ERa 
complex appear to require such chromosome conformation 
with higher DNA accessibility. Lastly, we used these fea- 
tures to predict loop-associated ERBSs (laERBSs) and to 
develop a DNA looping prediction algorithm. This allowed 
us to recover novel ERa-mediated interactions that were 
missed from the original ERa ChlA-PET paper (15). Many 
known ERa-regulated genes that were not found in the orig- 
inal ChlA-PET paper were identified by our method, and 
some are supported by other independent 3C data or Pol2 
ChlA-PET data. To our knowledge, this is the first success- 
ful attempt to use multiple ChlP-seq data to predict long- 
range chromatin interactions, thus could serve as a comple- 
ment to the complicated and costly ChlA-PET experiments. 

MATERIALS AND METHODS 

Data sources 

ChlA-PET data of ERa from E2 (17p-Estradiol) induced 
MCF7 cell was obtained from (15), where the P- value of 
each Paired-End Tag (PET) cluster was given; Pol2 and 
CTCF data from MCF7 and K562 cells were obtained 



from ENCODE project (20), and the coordinates of data 
were changed to hgl8 with the lift-over tool. The inter- 
chromosomal interactions were not considered in our anal- 
ysis as they were only composed of a very small proportion 
of the interactions (a few tens) and were not enough for sta- 
tistical analysis. ChlP-seq data of ERa, Pol2, H3K4mel, 
H3K4me3, H3K9ac, H3K27me3 and Input control from 
E2-induced MCF7 cells were from (21); ChlP-seq data of 
H3K4me2 and DNase-seq data from E2-induced MCF7 
cells were from (22); FoxAl and AP27 data from E2- 
induced MCF7 cells were from (23); GATA3 and p300 data 
from E2-induced MCF7 cell lines were from (24). GRO- 
seq data for E2-induced MCF7 cells after 40 min were 
from (25). E2-induced differential expressed genes were ob- 
tained from the supplementary data of (26). The related 
Gene Expression Omnibus (GEO) accession numbers are 
GSE11352, GSE18046, GSE23701, GSE23852, GSE29073, 
GSE33216, GSE39495, GSM678539 and GSM678540. 



Binding sites detection 

Binding sites from ChlP-seq were called by Model-based 
Analysis for ChlP-Seq (MACS) (27) with default parame- 
ters for ERa, FoxAl, GATA3, AP27, Pol2 and p300. En- 
richment regions for histone modifications were called by 
broad peak options with parameter setting '-nomodel - 
nolambda' as suggested by Feng et al (28). 



EE, PP and enhancer-promoter interactions classification 

Distance between two ChlP-seq binding peaks was calcu- 
lated as the genomic distance between their summits called 
by MACS (27). Distance between two genomic regions was 
defined as the genomic distance between their mid-points. 
Density estimation was called by the R density function 
with default parameters. Peak position of a uni-modal dis- 
tribution was chosen as the point with the highest density 
value. For multi-modal distribution, we first fitted the data 
with a Mixture Gaussian Distribution by R package mix- 
tools (29) and chose the estimated expectation of each com- 
ponent as peak position. 

We used log2 -transformed read-count ratio between 
H3K4me3 and H3K4mel ChlP-seq data in the Pol2 bind- 
ing sites to classify Pol2 ChlA-PET interaction clusters into 
three types: EE interactions, PP interactions and enhancer- 
promoter (EP) interactions. Windows for computing read 
counts were selected as ±1.5-kb region around the peak 
summits. Binding sites whose summits were closer than 1.5 
kb were merged and the mid-point of merged regions was 
chosen as the new peaks' 'pseudo summit'. The data were 
then fitted with Mixture Gaussian Model and a Bayesian 
posterior probability of 0.5 was set to determine whether a 
Pol2 binding site was promoter-like or enhancer-like. PET 
clusters with both ends classified as enhancer-like were 
called EE interactions, both promoter-like were called PP 
interactions and the rest were called EP interactions. For 
ERa ChlA-PET data, we applied this classification in those 
interactions that overlapped with H3K4mel or H3K4me3 
peaks at both ends. The subsequent steps were similar to 
those of Pol2 ChlA-PET data. 
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Histone modification, DNase-seq and GRO-seq profiles sur- 
rounding ChlP-seq binding peaks 

Each tag was extended by 200 bp along the read direction. 
We selected peak summit as the center for signal alignment. 
Each region was divided into 25-bp sub-regions and its read 
coverage was computed as the read counts covering this 
sub-region. The difference of read coverage between ChIP 
data and input data was computed as the log2 -transformed 
read-coverage ratio at each sub-region. DNase-seq data 
were not extended and only the 5' end was used as aggre- 
gation inputs. GRO-seq data were divided into two groups 
according to the strand of the reads and then processed in 
the same manner as histone modification data. 

To classify whether an ERBS was an ERE-containing 
one, we used STORM (30) to scan ±200-bp region around 
the summit of ERBS for ERE sequence motif, with P- value 
le-4. 



Model to predict laERBSs 

ERBSs were extracted from ChlP-seq data. To predict 
laERBSs, three types of features were computed: 

(i) Fif. loge -transformed read counts for ChlP-seq data of 
protein j (or DNase-seq data) in ERBS /, with a win- 
dow size of 400 bp centered around ERBS peak sum- 
mit; 

(ii) D{. loge-distance between the neighboring ERBS of 
ERBS i and ERBS /; 

(iii) Hi/, the differences between log2 -transformed ratio of 
read-coverage against input in central region (±100- 
bp region relative to the peak summit) to average of 
read-coverage against input in the two flanking regions 
(-400 bp to -200 bp and +200 bp to +400 bp relative 
to peak summit) of ERBS i for histone modification j. 

We selected the 1822 ERBSs that related to 903 interac- 
tions identified in both two ChlA-PET experimental repli- 
cates as foreground training set. An equal number of ERBSs 
that did not overlap with any interactions in either replicate 
were randomly selected as background training set. We used 
a logistic classifier to perform the classification, i.e. 

m = i) = 

{1 + exp(-fc 0 - ^ a Uj F Uj - biD t - ^ c Uj H Uj )y\ 

in which E t is the indicator whether ERBS i is a laERBS, 
P is a probability function and ko,ciij,bi,Cij are the model 
parameters. 

After training, we chose the most significant features that 
showed improvement over ERa ChlP-seq read counts in 
ERBSs to fix the final classifier. Then ERBSs were filtered 
by setting the threshold of the logistic classifier to be 0.2 
to find putative laERBSs. Summits of these laERBSs closer 
than 3 kb were merged and the mid-point was selected as 
the new 'pseudo summit'. This resulted in ~15 000 candi- 
date anchors. 



Model to predict ERBS interaction clusters 

To predict ER-associated interactions, the candidate an- 
chors were paired with each other to form candidate ERBS 
pairs. Those pairs that crossed topological domain bound- 
aries in hi -ESC were excluded since domain boundaries 
were roughly invariant across cell lines, and long-range 
interactions were largely restricted within topological do- 
mains (31). To predict interactions, two types of features 
were computed for each candidate pair: 

(i) PF hi2 j = F h j + F i2 f. the sum of log e -transformed jth 
ChlP-seq (or DNase-seq) read counts for each candi- 
date ERBS pair (hJi), with a window size of 3 kb for 
each end (±1.5-kb region relative to the peak summit); 

(ii) PD ilh j\ the log e -distance and inverse distance between 
each ERBS pair (/i,/2), since previous study (32) sug- 
gested that probability of long-range interaction was 
not monotonically related to genomic distance. 

Eight hundred ERBS interactions (out of 903 interac- 
tions identified in both two ChlA-PET experimental repli- 
cates) with both ends restricted within the same topologi- 
cal domain were selected as foreground training set. Equal 
number of candidate ERBS pairs that did not overlap with 
any ERa interactions identified in either experimental repli- 
cate was randomly selected background training set. We still 
used a logistic classifier to perform the classification, i.e. 

P(PE hh = 1) = 

{1 + exp(-fc 0 - ^2 <*hi 2 J PF hi2j ~ ^2j b hhJ PD hhj)r^ 

in which PE ili2 is the indicator whether ERBS pair (z'1,/2) 
formed an interaction, P is a probability function and 
ko, ahh,j, bhi 2 j are the model parameters. 

Performance of the classifier was evaluated by a 5 -fold 
cross-validation; AUC (area under curve) and ROC (re- 
ceiver operating characteristic) were plotted as the average 
of each 5-fold cross-validation. 

To predict novel ERBS-ERBS interactions, we trained 
the model parameters on the whole training set and applied 
it to the remaining candidate ERBS pairs. Predicted interac- 
tions that were not reported in either replicate were further 
analyzed. 



Model to predict promoter-ERBS interaction clusters 

To predict new promoter-ERBS interactions, we first found 
out those promoters containing at least one of the FoxAl, 
GATA3 or AP27 binding sites but not ERBSs from RefSeq- 
annotated promoters (±1500-bp region surrounding the 
transcription start site (TSS) and alternative TSSs within 
1.5 kb were merged). Those features used for predicting 
ERBS-ERBS interactions were computed by replacing dis- 
tance between two ERBSs with distance between promoter 
and ERBS. 
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Figure 1. Characteristic distance features of ChlA-PET data. (A) Logio- 
distance distribution between the two ends of chimeric and non-chimeric 
PETs, where non-chimeric PETs presented a trimodal distribution. The 
third peak at right of non-chimeric PETs was close to that of chimeric 
PETs. (B) Log io -distance distribution between the two ends of PETs 2+ 
clusters of MCF7 ERa ChlA-PET data for two replicates. Interactions 
with span more than 1 Mbps were excluded. Peak positions located at 3.8 
and 4.77. (C) Logio-distance distribution between the two ends of PETs 
2+ clusters from MCF7-specific and K562-specific Pol2 ChlA-PET data. 
Peak positions located at 3.7 and 4.64. (D) Logio-distance distribution of 
EE, PP and EP interactions from MCF7 Pol2 ChlA-PET data. Positions 
of peaks for EE and PP interactions were 3.7 and 4.8, respectively. 



RESULTS 

Invariant distance distributions are associated with different 
types of interactions 

We collected the ChlA-PET data for ERa (15), Pol2 (33) 
and CTCF from the GEO data sets (GEO accession num- 
bers GSE 18046 and GSE39495) and ENCODE project (20) 
for this study. We first analyzed the distribution of the dis- 
tance between the two ends of chimeric and non-chrimeric 
PETs (34). Raw PETs of ERa ChlA-PET data from MCF7 
cells presented a trimodal distribution (Figure 1A). The 
third peak at the right, which was mainly composed by sin- 
gle PET, was similar to that of chimeric PETs that rep- 
resented randomly paired interaction sites. This suggested 
that PETs with a long span (>1 Mbps) were more likely 
to be derived from random DNA contact noise in solu- 
tion or some non-specific interactions. Thus, they should 
be filtered more carefully. In the following analysis, we re- 
moved all PETs that had a span larger than 1 Mbps and 
only considered the interactions that could be supported by 
no less than two PETs (PETs 2+ clusters) instead of thresh- 
old (no less than three PETs without distance restriction) 
used in (15). This is because that majority of PETs 2+ clus- 
ters have a span less than 1 Mbps, which is well separated 
from the chimeric PETs. Setting a higher threshold will lose 



many true positives and make the selection of background 
less reliable. Interaction clusters from both ERa and Pol2 
data presented bimodal distribution patterns (Figure IB 
and C). For the Pol2 data, we used H3K4me3/H3K4mel 
log2 read-count ratio to classify the PETs 2+ clusters into 
three groups (33), i.e. EE interactions where both ends of the 
clusters showed higher H3K4mel signals, PP interactions 
where both ends of the clusters showed higher H3K4me3 
signals and EP interactions for the rest (see details in the 
Materials and Methods section). It was clear that the two 
peaks from Pol2 data corresponded to the characteristic dis- 
tances ~5 kb and ~60 kb for EE interactions and PP in- 
teractions, respectively (Figure ID). And we found that a 
great percentage of EE interactions are indeed within the 
flanking regions (±10 kb) of the same gene (Supplemen- 
tary Figure SI). We removed common interactions between 
MCF7 and K562 cells to study the cell-type specificity of EE 
and PP interactions. Both of the two peaks were present in 
MCF7 unique and K562 unique interactions (Figure 1C). 
ERa data displayed a similar, albeit less pronounced, char- 
acter. The bimodal pattern of distance distributions of chro- 
matin interaction clusters was largely conserved across dif- 
ferent cell types (in both MCF7 and K562 cells; Figure 1C), 
and thus appeared to be an invariant feature embedded in 
the typical promoter interacting TF ChlA-PET data. 

We also checked II interactions using CTCF ChlA-PET 
data. CTCF binding sites and interactions that overlapped 
with H3K4mel or H3K4me3 peaks at either end were fil- 
tered out to avoid promoter- or enhancer-associated inter- 
actions. The logio-distance distribution of the rest CTCF 
interaction clusters had a unimodal pattern (Supplemen- 
tary Figure S2A). The peak position was roughly invariant 
between MCF7 (160 kb) and K562 (200 kb) cells (Supple- 
mentary Figure S2A), and was roughly 5-fold as large as 
the median distance of the neighboring CTCF binding sites 
(Supplementary Figure S2B). This was consistent with the 
notion that CTCF regulation is largely conserved between 
different cell types (35). Thus, II interactions share a dif- 
ferent distance preference, which is also largely invariant 
across different cell lines. 

laERBSs are more likely to be nucleosome depleted 

To identify genetic or epigenetic features that could dis- 
criminate ERBSs involved in loop formation from those 
solo ones, we carried out integrative analysis of multiple 
(TF and histone modification) ChlP-seq data sets from the 
same cell line (MCF7). Firstly, we analyzed histone modi- 
fication patterns around laERBSs and non-loop-associated 
ERBSs (nlaERBSs). Here laERBSs were selected as ERBSs 
overlapping with PETs 2+ clusters identified in both two 
ChlA-PET experimental replicates (903 clusters). To align 
histone modification signals around ERBSs, we selected the 
strongest summit of ERBS as the center if more than one 
ERBS was contained in the same end of an interaction clus- 
ter, resulting 1203 ERBSs as the foreground set. We ran- 
domly selected an equal number of ERBSs that were not 
associated with any interaction clusters identified in either 
ChlA-PET experimental replicate as the background set. 
It was noticed that laERBSs showed subtle depletion of 
histone modification signal in the center compared with 
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Figure 2. Histone modification and DNase-seq signal profile for four 
groups of ERBSs. Average H3K4mel (A), H3K4me3 (B) and H3K9ac (C) 
read-coverage against input surrounding laERBSs with and without an 
ERE and nlaERBSs with and without an ERE. Average DNase-seq tag 
counts surrounding laERBSs with and without an ERE and nlaERBSs 
with and without an ERE (D). 



nlaERBSs (Supplementary Figure S3 AC). laERBSs and 
nlaERBSs were further divided into two groups by the pres- 
ence or absence of an ERE. Surprisingly, we found that 
a significantly more proportion of laERBSs with an ERE 
shown depleted histone modification signals in the center 
(Figure 2A-C, Supplementary Figure S4 and Supplemen- 
tary Table SI) than the other three groups. DNase-seq data 
confirmed that laERBSs were more frequently located in 
open chromatin regions (Supplementary Figure S3D), es- 
pecially those that contained an ERE (Figure 2D and Sup- 
plementary Figure S5). Since only about 10% laERBSs lo- 
cated in promoter regions, our observation could not be ex- 
plained by nucleosome depletion in active promoter regions 
(Supplementary Figure S6). Therefore, although to the best 
of our knowledge, there were no direct genome-wide nu- 
cleosome positioning data available in E2-induced MCF7 
cells, our integrative analysis of multiple histone modifica- 
tions and DNase-seq data suggested that nucleosomes were 
more likely to be depleted in laERBS. In fact, a previous 
paper (22) reported that ERa binding to DNA was not sen- 
sitive to nucleosomes; while on the contrary, we found that 
those laERBSs with an ERE showed significant nucleosome 
depletion in the center. This indicated that there might be 
some other co-factors that can bind to these laERBSs co- 
operatively to facilitate long-range interactions and be re- 
sponsible for nucleosome eviction. 

To test this hypothesis, we conducted an enrichment 
analysis against the ChlP-seq data of three known ERa's 
co-factors, and the general co-activator p300. More than 
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FoxA1 GATA3 AP2gamma p300 

Figure 3. Proportion of FoxAl, GATA3, AP27 and p300 binding peaks 
that overlapped with four groups of ERBSs. P-values smaller than 0.05 
were listed above the bars, which were given by one-tailed Fisher test in R. 
***P< 2.2e-16. 



80% laERBSs (either with an ERE or not) overlapped 
with FoxAl, GATA3 and AP27 binding sites (Figure 3). 
The percentages were significantly higher than those of 
nlaERBSs (P < 2.2e— 16; Supplementary Table S2), indi- 
cating that these three TFs are likely in the complex me- 
diating ERa-associated long-range interactions. Reported 
RNAi knock-down experiments also support the notion 
that these three co-factors are important for ERa loop for- 
mation (23,24). Thus they might be associated with nu- 
cleosome eviction in laERBSs. For p300, the percentage 
of overlapping with laERBSs was not as high as that of 
those three co-factors. However, interestingly, p300 was sig- 
nificantly enriched in laERBSs with an ERE in compar- 
ison with laERBSs without an ERE (P = 7.59e-4), but 
showed no difference between nlaERBSs with an ERE and 
nlaERBSs without an ERE (P = 0.72), while none of the 
three co-factors have a similar pattern. A previous study 
(36) showed that p300 is a component of an estrogen re- 
ceptor co-activator complex and the formation of the p300 
complex is associated with nucleosome eviction (37). This 
indicated that p300 might be more frequently recruited to 
laERBSs with an ERE for higher level of nucleosome de- 
pletion. In addition, with the GRO-seq data (GEO ac- 
cession numbers GSM678539 and GSM678540) we found 
that laERBSs showed stronger bidirectional transcription 
of small RNAs (smRNAs) compared to nlaERBSs (Sup- 
plementary Figure S7), with the highest transcription level 
occurring in laERBSs with an ERE (Supplementary Figure 
S7A). This is consistent with the notion that the bidirec- 
tional smRNAs transcription may help maintain the open 
chromatin structure in these regions (38). Taken together, 
these observations suggest that laERBSs, especially those 
with canonical EREs, are often associated with an open 
chromatin structure, which is likely resulted by the forma- 
tion of the looping-specific protein complex of ERa, its co- 
factors and p300, and associated with the bidirectional tran- 
scription of smRNAs. 



Clustered ERBSs are more likely to be associated with loops 

Of all the 1299 regions that formed PETs 2+ interaction 
clusters identified in both two ChlA-PET experimental 
replicates, 30% contained more than one ERBS. Statistical 
tests showed the clustered ERBSs within 3 kb were more 
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likely to associate with DNA loops (P < 2.2e— 16, 45% ver- 
sus 20%). 

A logistic classifier can predict laERBSs and associated long- 
range interactions 

We tested the prediction power of the above features ex- 
tracted from previous sections by building a classifier to pre- 
dict ERa-associated long-range interactions based on TF 
and histone modification ChlP-seq data. Given a pair of 
ERa ChlP-seq binding peaks, a powerful classifier should 
be able to judge how likely they are actually forming an in- 
teracting DNA loop. Our predictor was built in two steps. 
First, we used the above features to distinguish laERBSs 
and nlaERBSs. We used 903 PETs 2+ clusters confirmed in 
both two replicates as the positive training samples, since 
they were more reliable than those non-reproducible ones. 
We evaluated the performance of the classifier with 5-fold 
cross-validation and got the average true positive rate (TPR) 
at 74% and average false positive rate (FPR) at 21% at the 
Bayesian threshold (0.5) with only three features, i.e. ERa 
ChlP-seq signals, neighboring ERBS distances and DNase- 
seq signals. AUC for predictor with single feature and ROC 
comparisons for different feature combinations were shown 
in Figure 4A and C, respectively. Although transcriptional 
rate of bidirectional smRNA and the binding intensity of 
p300 showed good predictive power, they are redundant 
with ERa ChlP-seq signals, neighboring ERBS distances 
and DNase-seq signals. And unexpectedly, histone mod- 
ification features did not provide additional information 
over DNase-seq signals for the prediction. We finally set the 
Bayesian threshold 0.2 to contain as many laERBSs (in the 
training set) as possible (97%) while the relatively high FPR 
(61%) can be further dealt with in the next filtering step. 

In the second step, we tried to predict long-range interac- 
tions between the putative laERBSs obtained from the first 
step. Recently, Hi-C data analysis suggests that chromatins 
are organized as large topological domains and interactions 
between regulatory elements are largely restricted within 
these domains (31). As shown in (31), although strengths 
of chromatin interactions between and within these do- 
mains may vary across different cell lines, such domains are 
roughly invariant across different cell types. Thus, we fur- 
ther restricted the two ends of the same interaction cluster 
within the same Hi-C-determined topological domain. This 
criterion only excluded 69 (~8%) PETs clusters from our 
positive set, but reduced more than one half of ERBS pair- 
wise combinations, a large percentage of which are likely to 
be noise (there are 97 202 candidate ERBS pairs by using 
constrain of topological domain while there exist 239 016 
and 6 377 706 possible intra-chromosome ERBS pairs with 
and without restriction of 1-Mb genomic distance, respec- 
tively). With the above restrictions, we revealed 800 PETs 2+ 
interactions between filtered laERBSs within the topologi- 
cal domains, which served as the foreground training set. 
Classifiers with different feature combinations on the train- 
ing set were evaluated by 5-fold cross-validation. The best 
performer achieved a TPR of 93% and an FPR of 8% at 
the Bayesian threshold 0.5. The intensity of ERBSs and the 
distance between candidate pairs acted as the top two sig- 
nificant features to distinguish foreground and background 
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Figure 4. AUC and ROC curves for the predictors. (A) AUC for each 
feature to predict laERBSs. Here AUC was computed as the average 
area under ROC curve for predictors with each single feature for 5-fold 
cross-validation. (B) AUC for each feature to predict interactions be- 
tween predicted laERBSs. AUC was computed in the same manner as in 
(A), "dis" is distance between two ERBSs and "idis" is the inverse distance. 
(C) ROC trained for ERa alone, ERa+distance of neighboring ERBSs, 
ERa+distance of neighboring ERBSs+DNase-seq and all the features. (D) 
ROC trained for ERa alone, ERa+distance and ERa+distance+6 signifi- 
cant features and all the features. 



(Figure 4B). The binding intensity of Fox A 1 and AP27 had 
a similar predictive power, but was somewhat redundant 
with that of ERa. Beyond these features, DNase-seq and 
histone modification signals that were able to describe chro- 
matin accessibility could also improve the performance. Fi- 
nally, eight features (Supplementary Tables S3 and S4) that 
showed significant improvement over the ERa+distance 
combination (Figure 4D) were selected to build the final 
classifier. 

We applied the classifier to predict interactions among 
the 97 202 candidate ERBS pairs. Originally, Fullwood et al 
(15) reported 3527 high confidential interactions that share 
ERa ChlP-seq binding peaks at both ends for the com- 
bined two replicates. Among them, 3113 were restricted in 
the topological domains, 76% (2356) of which could be re- 
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Figure 5. Evaluation of novo predicted interactions by other independent 
data sets. (A) Proportion of reported, predicted and unreported and rest 
of the candidate (non-predicted and unreported) ERBS-ERBS pairs that 
overlapped with Pol2 PETs 2+ and PETs 3+ clusters, respectively. (B) Pro- 
portion of reported, predicted and unreported and rest of the candidate 
(non-predicted and unreported) ERBS-promoter pairs that overlapped 
with Pol2 PETs 2+ and PETs 3+ clusters, respectively. 



called, while only 9% (8805) of the remaining 94 089 un- 
reported pairs were predicted to be ERa-associated inter- 
action clusters. Therefore we conclude that our classifier is 
highly effective in predicting known ChlA-PET interactions 
from multiple ChlP-seq data. 

Many novo predicted interactions are supported by other data 
sources 

Next, we compared our newly predicted 8805 interactions 
with other sources of data to confirm their biological sig- 
nificance. They are significantly overlapping with known 
Pol2 PETs clusters (33) (P < 2.2e-16) (Figure 5A and 
Supplementary Table S5). And the overlapping proportion 
was comparable with that of 31 13 high-confidential ERBS- 
ERBS pairs reported in (15) (Figure 5 A and Supplementary 
Table S5). This observation suggested that our predicted 
novel ERBS interactions might contain a considerate pro- 
portion of true positives that were not captured by the pub- 
lished ChlA-PET analysis (15), probably due to the fact that 
the original ERa experiments were not saturated and many 
interactions were filtered out to reduce false positives. 

New ERa target genes are predicted 

There are thousands of ERBSs in the genome (21), but only 
a small fraction of them (10-15% in different experiments) 
are located in regions proximal to gene TSS (±1500 bp). 
ChlA-PET experiments validated that a considerate per- 
centage of ERBSs (~10%) function via long-range inter- 
actions to their target promoters (15). Here we further ex- 
tended our classifier to predict potential target genes regu- 
lated by the distal ERa-bound enhancers but without direct 
ERa binding at the promoters, which were hard to detect 
with traditional ChlP-seq analysis methods. Candidate pro- 
moters were selected as those that contained at least one of 
the FoxAl, GATA3 or AP27 ChlP-seq binding peaks, but 
not ERBSs. Again, we restricted all those promoter-ERBS 
pairs within the topological domains. The classifier identi- 
fied 507 pairs of ERBS-promoter interactions, associated 
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Figure 6. Genome browser view by IGV tools (42, 43) of predicted loops 
around CTSD and XBP1 gene. (A) predicted interaction between promoter 
of CTSD gene and a -9kb upstream ERBS. (B) predicted interaction be- 
tween promoter of XBP 1 gene and a -13kb upstream ERBS. 



with 374 genes. These predicted interactions were signifi- 
cantly overlapping with Pol2 PETs clusters (P < 2.2e— 16) 
as expected (Figure 5B and Supplementary Table S6), and 
similar to the 113 ERBS-promoter pairs (without ERBS at 
promoters and restricted in topological domains) reported 
in (15) (Figure 5B and Supplementary Table S6). From the 
gene expression data (GSE1 1352), we found those genes up- 
regulated at 4 h, 12 h and 24 h were all significantly enriched 
in the predicted genes than in rest of the candidate genes, 
while the down-regulated genes were not (Supplementary 
Table S7). Interestingly, some well-known ERa target genes 
that have been reported in the literature but not detected in 
the original ERa ChlA-PET study appeared in our predic- 
tion. For example, 3C experiments confirmed that CTSD's 
promoter (Figure 6A) can form a loop with the — 9 kb up- 
stream ERBS containing enhancer (39). CTSD is impor- 
tant for tumor progression and in metastasis and is used as 
a specific biomarker in breast cancer diagnosis (40). XBP1 
(Figure 6B), a TF involved in the unfolded protein response 
(41), is also an estrogen-regulated gene and its expression 
is strongly correlated with ERa expression in breast cancer. 
Sengupta et al (41) have reported that the — 1 3-kb enhancer 
upstream of the XBP1 promoter is an E2-response regula- 
tory element that can functionally regulate XBP1 gene ex- 
pression as we predicted. These observations suggest that 
our classifier can reliably predict ERa target genes regulated 
by distal elements while the common ChlP-seq data analy- 
sis has failed. 



6942 Nucleic Acids Research, 2014, Vol 42, No. 11 



DISCUSSION 

Long-range interaction is an important and complex mech- 
anism that regulates gene expression in space and time. 
With 3C-based technologies, many functional long-range 
interactions can be detected genome-widely. Among them, 
ChlA-PET can provide a more detailed view of whole 
genome interactions for a given TF. Our comprehensive re- 
analysis of ChlA-PET data revealed invariant characteris- 
tic distance features between different regulatory elements. 
Such distance features are largely unchanged across differ- 
ent cell types for EE interactions (5-6 kb), PP interactions 
(60-80 kb) and II interactions (160-200 kb). These charac- 
teristic distance features may reflect the underlying invari- 
ant properties of the structural organization of these regu- 
latory elements in 3D chromatin DNA. 

Our integrative analysis of histone modification and 
DNase-seq data showed that, although some ERa DNA 
binding sites may not be sensitive to nucleosomes, laERBSs 
are often found in open chromatin regions. This implies 
those ERBSs that are involved in specific long-range reg- 
ulatory interactions with their target genes may be strongly 
dependent on local open chromatin structure in order to 
accommodate sophisticated protein complexes through 3D 
DNA looping. We expect that this insight may be generally 
applicable to many other different TFs, where nucleosome 
eviction can act as a predictive mark to distinguish loop- 
associated transcription factor binding sites (TFBSs) from 
solo TFBSs. 

Another sex hormonal receptor, the androgen receptor 
(AR), is a TF that is very similar to ERa, as they both coop- 
erate with FoxAl for binding. A previous paper (22) reports 
that their binding sites have different chromatin accessibil- 
ity patterns: AR favors pre-defined nucleosome-depleted re- 
gions, while ERa does not. However, here we showed that 
laERBSs, especially those with an ERE, have a similar static 
open chromatin structure just like most AR binding sites, 
indicating a more general long-range regulation mechanism 
by nuclear hormone receptors. 

By extracting features from multiple ChlP-seq data sets 
of histone modification and TF binding profiles, we have de- 
veloped classifiers to predict laERBSs and significantly in- 
teracting ERBS pairs. The ERa ChlP-seq signal, distance 
of the neighboring ERBS and DNase-seq signals are the 
most predictive features for laERBS, while other features 
are more or less redundant with these three ones. The re- 
strictions of interactions between predicted laERBSs and 
within Hi-C-defined topological domains have proved to be 
very effective filters. Our final trained logistic classifier can 
not only recover a large percentage (76%) of reported ERa 
interactions but also predict many novel ERBS pairs that 
are validated by independent 3C or ChlA-PET experiments. 
We also applied our model to predict ERBS-promoter in- 
teractions. Some newly predicted ERa target genes, whose 
promoters were not directly bound by ERa and hence un- 
detected by common ChlP-seq analysis, can also be linked 
with the E2-induction process or breast cancer pathways 
through other experimental supporting evidence. This in 
turn validates our model. The method we have described 
here should be applicable to other TFs, such as AR. 



There are still two main potential limitations of our 
model. One is that we used topological domains in HI cells, 
not in MCF7 cells due to lack of the data, to filter ERBS 
pairs before interaction prediction. Although it is reported 
that topological domains are largely conserved between dif- 
ferent cell types and only a small proportion of ERa ChlA- 
PET-detected interactions in MCF7 cells were filtered out. 
Using topological domains detected in the same cell type 
would be able to further improve the performance. The 
other is that our predictions depend on multiple ChlP-seq 
data, especially those of ERa's co-factors, which is not al- 
ways available in other types of cells. Overall, our work in- 
dicates that integrative analysis of ChlP-seq, Hi-C, ChlA- 
PET data, etc. could overcome the limits of each single 
method and provide a more comprehensive understanding 
of the chromatin interaction landscape at multi-scales. 

CONCLUSION 

In this work, we carried out a comprehensive analysis of 
ERa ChlA-PET data by combining gene expression, TF 
binding, histone modification profiles and open chromatin 
conformation data together. We showed that laERBSs were 
more likely to be nucleosome depleted compared with 
nlaERBSs. They were also significantly overlapping with 
FoxAl, GATA3, AP27, and p300 ChlP-seq binding peaks. 
An efficient classifier was developed to predict laERBSs and 
chromatin interactions between these laERBSs. Among all 
the features, the ERa ChlP-seq signal, distance of the neigh- 
boring ERBS and DNase-seq signals are the most predic- 
tive features for laERBSs. When predicting ERBS interac- 
tions, the restriction within Hi-C determined topological 
domains is very effective to filter many potential false pos- 
itives. The logistic classifier we trained can recover a large 
percentage (76%) of ChlA-PET experiment identified ERa 
interactions. Besides, many of our predicted novo ERBS in- 
teractions could be validated by independent 3C or other 
ChlA-PET data sets. The model was applied to predict the 
interactions between distal ERBS and promoter. We found 
that some newly predicted ERa target genes whose promot- 
ers did not overlap with ERBSs were associated with breast 
cancer related gene ontology items. Comparing with tradi- 
tional analysis of ChlP-seq and ChlA-PET data, our inte- 
grative analysis and predictive model can provide a better 
understanding of the chromatin long range interactions. 
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